13 research outputs found
Distributed Computing in a Pandemic
The current COVID-19 global pandemic caused by the SARS-CoV-2 betacoronavirus has resulted in over a million deaths and is having a grave socio-economic impact, hence there is an urgency to find solutions to key research challenges. Much of this COVID-19 research depends on distributed computing. In this article, I review distributed architectures -- various types of clusters, grids and clouds -- that can be leveraged to perform these tasks at scale, at high-throughput, with a high degree of parallelism, and which can also be used to work collaboratively. High-performance computing (HPC) clusters will be used to carry out much of this work. Several bigdata processing tasks used in reducing the spread of SARS-CoV-2 require high-throughput approaches, and a variety of tools, which Hadoop and Spark offer, even using commodity hardware. Extremely large-scale COVID-19 research has also utilised some of the world's fastest supercomputers, such as IBM's SUMMIT -- for ensemble docking high-throughput screening against SARS-CoV-2 targets for drug-repurposing, and high-throughput gene analysis -- and Sentinel, an XPE-Cray based system used to explore natural products. Grid computing has facilitated the formation of the world's first Exascale grid computer. This has accelerated COVID-19 research in molecular dynamics simulations of SARS-CoV-2 spike protein interactions through massively-parallel computation and was performed with over 1 million volunteer computing devices using the Folding@home platform. Grids and clouds both can also be used for international collaboration by enabling access to important datasets and providing services that allow researchers to focus on research rather than on time-consuming data-management tasks
Distributed Computing in a Pandemic: A Review of Technologies Available for Tackling COVID-19
The current COVID-19 global pandemic caused by the SARS-CoV-2 betacoronavirus
has resulted in over a million deaths and is having a grave socio-economic
impact, hence there is an urgency to find solutions to key research challenges.
Much of this COVID-19 research depends on distributed computing. In this
article, I review distributed architectures -- various types of clusters, grids
and clouds -- that can be leveraged to perform these tasks at scale, at
high-throughput, with a high degree of parallelism, and which can also be used
to work collaboratively. High-performance computing (HPC) clusters will be used
to carry out much of this work. Several bigdata processing tasks used in
reducing the spread of SARS-CoV-2 require high-throughput approaches, and a
variety of tools, which Hadoop and Spark offer, even using commodity hardware.
Extremely large-scale COVID-19 research has also utilised some of the world's
fastest supercomputers, such as IBM's SUMMIT -- for ensemble docking
high-throughput screening against SARS-CoV-2 targets for drug-repurposing, and
high-throughput gene analysis -- and Sentinel, an XPE-Cray based system used to
explore natural products. Grid computing has facilitated the formation of the
world's first Exascale grid computer. This has accelerated COVID-19 research in
molecular dynamics simulations of SARS-CoV-2 spike protein interactions through
massively-parallel computation and was performed with over 1 million volunteer
computing devices using the Folding@home platform. Grids and clouds both can
also be used for international collaboration by enabling access to important
datasets and providing services that allow researchers to focus on research
rather than on time-consuming data-management tasks.Comment: 21 pages (15 excl. refs), 2 figures, 3 table
Distributed Computing in a Pandemic
The current COVID-19 global pandemic caused by the SARS-CoV-2 betacoronavirus has resulted in over a million deaths and is having a grave socio-economic impact, hence there is an urgency to find solutions to key research challenges. Much of this COVID-19 research depends on distributed computing. In this article, I review distributed architectures -- various types of clusters, grids and clouds -- that can be leveraged to perform these tasks at scale, at high-throughput, with a high degree of parallelism, and which can also be used to work collaboratively. High-performance computing (HPC) clusters will be used to carry out much of this work. Several bigdata processing tasks used in reducing the spread of SARS-CoV-2 require high-throughput approaches, and a variety of tools, which Hadoop and Spark offer, even using commodity hardware. Extremely large-scale COVID-19 research has also utilised some of the world's fastest supercomputers, such as IBM's SUMMIT -- for ensemble docking high-throughput screening against SARS-CoV-2 targets for drug-repurposing, and high-throughput gene analysis -- and Sentinel, an XPE-Cray based system used to explore natural products. Grid computing has facilitated the formation of the world's first Exascale grid computer. This has accelerated COVID-19 research in molecular dynamics simulations of SARS-CoV-2 spike protein interactions through massively-parallel computation and was performed with over 1 million volunteer computing devices using the Folding@home platform. Grids and clouds both can also be used for international collaboration by enabling access to important datasets and providing services that allow researchers to focus on research rather than on time-consuming data-management tasks
The application of Hadoop in structural bioinformatics
The paper reviews the use of the Hadoop platform in structural bioinformatics applications. For structural bioinformatics, Hadoop provides a new framework to analyse large fractions of the Protein Data Bank that is key for high-throughput studies of, for example, protein-ligand docking, clustering of protein-ligand complexes and structural alignment. Specifically we review in the literature a number of implementations using Hadoop of high-throughput analyses and their scalability. We find that these deployments for the most part use known executables called from MapReduce rather than rewriting the algorithms. The scalability exhibits a variable behaviour in comparison with other batch schedulers, particularly as direct comparisons on the same platform are generally not available. Direct comparisons of Hadoop with batch schedulers are absent in the literature but we note there is some evidence that Message Passing Interface implementations scale better than Hadoop. A significant barrier to the use of the Hadoop ecosystem is the difficulty of the interface and configuration of a resource to use Hadoop. This will improve over time as interfaces to Hadoop, e.g. Spark improve, usage of cloud platforms (e.g. Azure and Amazon Web Services (AWS)) increases and standardised approaches such as Workflow Languages (i.e. Workflow Definition Language, Common Workflow Language and Nextflow) are taken up
A Novel Method to Detect Bias in Short Read NGS Data
Detecting sources of bias in transcriptomic data is essential to determine signals of Biological significance. We outline a novel method to detect sequence specific bias in short read Next Generation Sequencing data. This is based on determining intra-exon correlations between specific motifs. This requires a mild assumption that short reads sampled from specific regions from the same exon will be correlated with each other. This has been implemented on Apache Spark and used to analyse two D. melanogaster eye-antennal disc data sets generated at the same laboratory. The wild type data set in drosophila indicates a variation due to motif GC content that is more significant than that found due to exon GC content. The software is available online and could be applied for cross-experiment transcriptome data analysis in eukaryotes
PDB-Hadoop
<p>This is the alpha release of the PDB-Hadoop framework. This framework developed by Jamie Alnasir and Hugh Shanahan at Royal Holloway University of London facilitates the parallel execution of protein structure analysis tools to be carried out on the entire (or subsets of) the Protein Databank (PDB) using the Apache Hadoop platform. The framework is designed so that structural Biologists can use the Hadoop platform without having to explicitly write Hadoop code. The framework is easily scalable and uses a mapper architecture that functions on a stand-alone basis or can be extended to include further Map-Reduce operations.</p
SRA SQL metadata
These are a set of SQL databases that have been downloaded from the SRA database.<br